-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sse non-temporal loads/stores #11
base: master
Are you sure you want to change the base?
Conversation
clean up perf and merge in decode_perf get rid of unused function warnings
Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM? Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data. My concern is that the performance will actually be worse in an actual application with non-temporal writes. For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense. I am much less convinced regarding decoding. I think we should measure in a realistic scenario. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM?
Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data.
My concern is that the performance will actually be worse in an actual application with non-temporal writes.
For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense.
I am much less convinced regarding decoding.
I think we should measure in a realistic scenario.
Yes, it doesn't leave the data in cache obviously. Unfortunately the non-aligned compressed data is hard to write non-temporally. Maybe it could be packed into a buffer register or two and flushed in 16 byte units, but it's a headache. |
What I'd suggest, if we are going to merge this, is to make it an option, but not necessarily the default. |
|
Switching SSE to non-temporal store gives a large speedup (~50% on my i5) in decoding. Non-temporal loads don't seem to make a difference in encoding. Both require 16-byte alignment, so we may want to make this a build option rather than straight merge.
perf in this branch tests both encoding and decoding in one target.